Training and test sets¶

We've seen previously how to fit a model to a dataset. In this exercise, we'll be looking at how to check and confirm the validity and performance of our models by using training and testing sets. As usual, we begin by loading in and having a look at our data:

In [1]:
import pandas
!pip install statsmodels
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/dog-training.csv
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/dog-training-switzerland.csv

data = pandas.read_csv("dog-training.csv", delimiter="\t")

print(data.shape)
print(data.head())
Requirement already satisfied: statsmodels in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (0.11.0)
Requirement already satisfied: scipy>=1.0 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from statsmodels) (1.5.3)
Requirement already satisfied: numpy>=1.14 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from statsmodels) (1.21.6)
Requirement already satisfied: patsy>=0.5 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from statsmodels) (0.5.2)
Requirement already satisfied: pandas>=0.21 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from statsmodels) (1.1.5)
Requirement already satisfied: six in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from patsy>=0.5->statsmodels) (1.16.0)
Requirement already satisfied: python-dateutil>=2.7.3 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from pandas>=0.21->statsmodels) (2.8.2)
Requirement already satisfied: pytz>=2017.2 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from pandas>=0.21->statsmodels) (2022.1)
--2023-08-10 14:55:58--  https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21511 (21K) [text/plain]
Saving to: ‘graphing.py.1’

graphing.py.1       100%[===================>]  21.01K  --.-KB/s    in 0s      

2023-08-10 14:55:58 (103 MB/s) - ‘graphing.py.1’ saved [21511/21511]

--2023-08-10 14:56:02--  https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/dog-training.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 974 [text/plain]
Saving to: ‘dog-training.csv.1’

dog-training.csv.1  100%[===================>]     974  --.-KB/s    in 0s      

2023-08-10 14:56:02 (70.4 MB/s) - ‘dog-training.csv.1’ saved [974/974]

--2023-08-10 14:56:06--  https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/dog-training-switzerland.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12362 (12K) [text/plain]
Saving to: ‘dog-training-switzerland.csv’

dog-training-switze 100%[===================>]  12.07K  --.-KB/s    in 0s      

2023-08-10 14:56:06 (40.8 MB/s) - ‘dog-training-switzerland.csv’ saved [12362/12362]

(50, 5)
   month_old_when_trained  mean_rescues_per_year  age_last_year  \
0                      68                   21.1              9   
1                      53                   14.9              5   
2                      41                   20.5              6   
3                       3                   19.4              1   
4                       4                   24.9              4   

   weight_last_year  rescues_last_year  
0              14.5                 35  
1              14.0                 30  
2              17.7                 34  
3              13.7                 29  
4              18.4                 30  

We're interested in the relationship between a dog's weight and the amount of rescues it performed in the previous year. Let's begin by plotting rescues_last_year as a function of weight_last_year:

In [2]:
import graphing
import statsmodels.formula.api as smf

# First, we define our formula using a special syntax
# This says that rescues_last_year is explained by weight_last_year
formula = "rescues_last_year ~ weight_last_year"

model = smf.ols(formula = formula, data = data).fit()

graphing.scatter_2D(data, "weight_last_year", "rescues_last_year", trendline = lambda x: model.params[1] * x + model.params[0])

There seems to be a pretty clear relationship between a dog's weight and the number of rescues it's performed. That seems pretty reasonable, as we'd expect heavier dogs to be bigger and stronger and thus better at saving lives!

Train/test split¶

This time, instead of fitting a model to the entirety of our dataset, we're going to separate our dataset into two smaller partitions: a training set and a test set.

The training set is the largest of the two, usually made up of between 70-80% of the overall dataset, with the rest of the dataset making up the test set.

By splitting our data, we're able to gauge the performance of our model when confronted with previously unseen data.

Notice that data on the test set is never used in training. For that reason, it's commonly referred to as unseen data or data that is unknown by the model.

In [3]:
from sklearn.model_selection import train_test_split


# Obtain the label and feature from the original data
dataset = data[['rescues_last_year','weight_last_year']]

# Split the dataset in an 70/30 train/test ratio. We also obtain the respective corresponding indices from the original dataset.
train, test = train_test_split(dataset, train_size=0.7, random_state=21)

print("Train")
print(train.head())
print(train.shape)

print("Test")
print(test.head())
print(test.shape)
Train
    rescues_last_year  weight_last_year
33                 30              19.4
0                  35              14.5
13                 36              19.5
28                 31              16.1
49                 37              23.0
(35, 2)
Test
    rescues_last_year  weight_last_year
7                  37              17.1
44                 25              15.4
43                 26              20.0
25                 32              22.2
14                 32              18.3
(15, 2)

We notice that these sets are different, and that the training set and test set contain 70% and 30% of the overall data, respectively.

Let's have a look at how the training set and test set are separated out:

In [4]:
# You don't need to understand this code well
# It's just used to create a scatter plot

# concatenate training and test so they can be graphed
plot_set = pandas.concat([train,test])
plot_set["Dataset"] = ["train"] * len(train) + ["test"] * len(test)

# Create graph
graphing.scatter_2D(plot_set, "weight_last_year", "rescues_last_year", "Dataset", trendline = lambda x: model.params[1] * x + model.params[0])

Training Set¶

We begin by training our model using the training set, testing its performance with the same training set:

In [5]:
import statsmodels.formula.api as smf
from sklearn.metrics import mean_squared_error as mse

# First, we define our formula using a special syntax
# This says that rescues_last_year is explained by weight_last_year
formula = "rescues_last_year ~ weight_last_year"

# Create and train the model
model = smf.ols(formula = formula, data = train).fit()

# Graph the result against the data
graphing.scatter_2D(train, "weight_last_year", "rescues_last_year", trendline = lambda x: model.params[1] * x + model.params[0])

We can gauge our model's performance by calculating the mean squared error (MSE).

In [6]:
# We use the in-buit sklearn function to calculate the MSE
correct_labels = train['rescues_last_year']
predicted = model.predict(train['weight_last_year'])

MSE = mse(correct_labels, predicted)
print('MSE = %f ' % MSE)
MSE = 18.674546 

Test Set¶

Next, we test the same model's performance using the test set:

In [7]:
graphing.scatter_2D(test, "weight_last_year", "rescues_last_year", trendline = lambda x: model.params[1] * x + model.params[0])

Let's have a look at the MSE again.

In [8]:
correct_labels = test['rescues_last_year']
predicted = model.predict(test['weight_last_year'])

MSE = mse(correct_labels, predicted)
print('MSE = %f ' % MSE)
MSE = 24.352949 

We learn that the model performs much better on the known training data than on the unseen test data (remember that higher MSE values are worse).

The reason can be due to a number of factors, but first and foremost is overfitting, which is when a model matches the data in the training set too closely. This means that it will perform very well on the training set, but will not generalize well. (that is, it won't work well with other datasets).

New Dataset¶

To illustrate our point further, let's have a look at how our model performs when confronted with a completely new, unseen, and larger dataset. For our scenario, we'll use data provided by the avalanche rescue charity's European branch.

In [9]:
# Load an alternative dataset from the charity's European branch
new_data = pandas.read_csv("dog-training-switzerland.csv", delimiter="\t")

print(new_data.shape)
new_data.head()
(500, 5)
Out[9]:
month_old_when_trained mean_rescues_per_year age_last_year weight_last_year rescues_last_year
0 9 16.7 2 15.709342 30
1 33 24.2 8 14.760819 35
2 43 20.2 4 13.118374 19
3 37 19.2 5 10.614075 24
4 45 16.9 8 17.519890 28

The features are the same, but we have much more data this time. Let's see how our model does!

In [10]:
# Plot the fitted model against this new dataset. 

graphing.scatter_2D(new_data, "weight_last_year", "rescues_last_year", trendline = lambda x: model.params[1] * x + model.params[0])

And now, the MSE:

In [11]:
correct_labels = new_data['rescues_last_year']
predicted = model.predict(new_data['weight_last_year'])

MSE = mse(correct_labels, predicted)
print('MSE = %f ' % MSE)
MSE = 20.406905 

As expected, the model performs better on the training dataset as it does on the unseen dataset. This is simply due to overfitting, as we noted previously.

Interestingly, the model performs better on this unseen dataset than it does on the test set. This is because our previous test set was quite small, and thus not a very good representation of real-world data. By contrast, this unseen dataset is large and a much better representation of data we'll find outside of the lab. In essence, this shows us that part of performance difference we see between training and test is due to model overfitting, and part of the error is due to the test set not being perfect. In the next exercises, we'll explore the trade-off we have to make between training and test dataset sizes.

Summary¶

In this exercise, we covered the following concepts:

  • Splitting a dataset into a training set and a test set
  • Training a model using the training set and testing its performance on the training set, test set, and on a new, unseen dataset
  • Compared the respective MSEs to highlight the effects and dangers of overfitting